What assumptions does a GWAS make about a population?
What happens when these assumptions are not met?
When you have completed this module, you will know:
… how to summarize the assumptions (biological and statistical) that a GWAS makes about the individuals in a study and the structure of the data.
… how to identify instances when your data show evidence of not meeting these assumptions.
… what tools are available for analyzing complex data that do not meet the basic GWAS assumptions.
… how to implement the tools mentioned above using R
The starting point for the computation in this module depends on what you did in the previous module (imputation). If your data was too big to read into your computer’s memory, read the section labeled “PCA for large/not fully imputed data” for your computational methods. If you were able to work with a matrix of all your SNP data, refer to the “PCA for fully-imputed data” for computational methods.
Remember: “fully imputed” means that there are no missing values in the data set.
As a start, load the packages we will need for this module:
# Load our packages (same ones mentioned in the data module)
# two pacakges are from Bioconductor
library(snpStats)
library(SNPRelate)
# all other packages are on CRAN
library(data.table)
library(dplyr)
library(magrittr)
library(ggplot2)
library(ncvreg)
library(RSpectra)
# register cores for parallel processing
library(doSNOW)
registerDoSNOW(makeCluster(4))
In a GWAS context, population structure is a kind of relatedness among the individuals represented in the data set. When considering relatedness, it is helpful to define the phrase identical by descent. Two alleles from the same locus are identical by descent if they can be traced back from the same allele in an earlier generation. For more information on this concept, check out this example.
Population structure is defined by the existence of allele frequency differences that characterize sub-populations and is driven by the combined effects of evolutionary processes such as genetic drift, migration, demographic history, and natural selection.
Population structure is a common phenomenon in genetic data. Varying levels of relatedness are almost always present among genetic samples, even in samples of unrelated individuals and seemingly homogeneous populations. For example, European American, Han Chinese, and recently, cohorts within the UK Biobank data have been shown to exhibit patterns of population and geographic structure despite their seemingly similar subjects.
Population structure is broadly categorized based on whether it describes recent or ancient relatedness. Ancient relatedness describes the presence of a common ancestor many generations previously. The presence of distinct ancestry groups with different allele frequencies in a sample is known as population stratification (a.k.a ancestry differences). Recent relatedness describes the sharing of a common ancestor only several generations previously. Pedigree-based methods may be used to explicitly model recent relatedness if familial relationships are known. In the absence of known familial relationships, recent relatedness is referred to as cryptic relatedness. 1
Population stratification in particular has been of great concern in genetic studies due to its potential to lead to spurious associations when population structure is associated with differences in both allele frequency and the trait or disease.
As an example, consider a GWAS to assess genetic variants associated with lung cancer in a sample comprised of subjects from two distinct subpopulations, A and B. Assume the minor allele of SNP X is present with higher frequency in subpopulation A compared to subpopulation B, but has no direct effect on lung cancer. Also suppose these subpopulations are geographically segregated in a such a way that subpopulation A is exposed to good air quality, and subpopulation B to poor air quality, and that air quality has a direct effect on lung cancer. A GWAS of data from these subpopulations would find SNP X to be significantly associated with lung cancer, even though we know it has no effect on lung cancer. If subpopulations A and B were not subject to different air qualities, all else being equal, SNP X would not be found to be associated with the phenotype.
Another apocryphal but illustrative example of how population stratification can lead to confounding in genetic studies is linked below.
GWAS (and many statistical tests) assume samples are independent. Cryptic relatedness and population structure can invalidate these tests since the presence of non-independent samples, and thus non-independent errors, can lead to inflated test statistics.
With this in mind, it is critical to evaluate and account for potential population structure in our data in order to avoid false positives and negatives.
One of the simplest and most common methods used to assess and correct for population structure in our data is with principle component analysis (PCA). Here, I will present a ‘crash course’ on the concepts of PCA and provide an example.
Conceptually, PCA can be thought of as extracting the axes of greatest variability in our data. PCA creates linear combinations of a given set of variables; this lets the user represent a lot of correlated variables with a smaller number of uncorrelated variables - so PCA gives us dimension reduction, which is really helpful for high dimensional settings like GWAS data.
Connections to different conceptual frameworks:
From a linear modeling framework, PCA is an orthogonal linear transformation of a data set.
From a machine learning framework, PCA is a type of unsupervised learning.
As a brief example with a smaller data set, I’m going to illustrate some of these concepts using some smaller data with known structure.
First, we’ll read in the small data and filter out the monomorphic SNPs. Note that this data set contains a known race variable.
# read in smaller data
smaller_data <- read.delim("https://s3.amazonaws.com/pbreheny-data-sets/admixture.txt")
# see what it looks like
smaller_data[1:5, 1:5]
# Race Snp1 Snp2 Snp3 Snp4
# 1 African American 0 0 0 0
# 2 African American 0 0 0 0
# 3 African American 0 0 0 0
# 4 African American 0 0 0 0
# 5 African American 1 0 0 1
# assign the 'race' variable
race <- smaller_data$Race
# create a matrix with only SNP data
all_SNPs <- as.matrix(smaller_data[,-1])
# filter out monomorphic SNPs
polymorphic <- apply(all_SNPs, 2, sd) != 0
SNPs<- all_SNPs[,polymorphic]
# look at the resulting matrix dimensions
dim(SNPs)
# [1] 197 98
# see what the data set looks like in terms of racial categories
table(race)
# race
# African African American European Japanese
# 50 47 50 50
Next, I’m going to compute the principal components (PCs) and plot a scree plot. A scree plot tells us the proportion of variance explained by each of the PCs. PCA is such that the PCs are ordered from those that explain the greatest to least amount of variability. Scree plots are sometimes used to decide how many PCs are appropriate to include. We look for the ‘elbow’ in the plot, which indicates when the proportion of variance explained by including additional PCs may not be worth the extra df. There are also a number of more complex tests and tools to determine this as well, or often the top 10 PCs are simply used in practice.
# use the convenient prcomp() function to do all the PCA steps in one line!
pca <- prcomp(SNPs, center = TRUE, scale = TRUE)
# look at the results
pca$x[1:5, 1:5]
# PC1 PC2 PC3 PC4 PC5
# [1,] 1.6652335 -0.3909691 -0.7799540 -2.3234440 -2.0802964
# [2,] 1.3048975 -2.2997440 -0.4810622 1.9478506 1.6806863
# [3,] 3.2633748 -0.1948454 -0.6283735 2.7211461 0.9575737
# [4,] 0.5371343 -1.5322968 0.6324284 -0.4261897 2.3696399
# [5,] 3.3810463 -1.2142128 -1.8344268 -0.8499357 -1.6181324
# plot the top 10 PCs in a scree plot
plot(x = 1:10,
y = 100 * proportions(pca$sdev[1:10]^2),
type = 'b',
ylab = 'Proportion of variance explained',
xlab = 'PC',
main = 'Example Scree Plot')
We see this elbow point at 3 PCs, which makes sense, given that there are 4 distinct races in this data set (we lose one degree of freedom when we center our data). The first PC explains about \(30 \%\) of the variance in our data, while the second PC explains a little less than \(20 \%\). Together, this means that the first 3 PCs explain over half of the variance in the data set.
Now we’ll plot the first two PCs against each other, and color the data points by the known race of each subject. We expect to see clustering that corresponds to population structure.
pca_dat <- data.frame(race = race, PC1 = pca$x[,1], PC2 = pca$x[, 2])
pca_plot <- ggplot(pca_dat, aes(x = PC1, y = PC2, col = race)) +
geom_point() +
coord_fixed()
plot(pca_plot)
Indeed, we see that the first two PCs differentiate the racial groups, which cluster together. We can see that PC1 seems to differentiate the European/Japanese populations from the African/African American ones, while PC2 seems to primarily differentiate the European and Japanese populations. Even if we did not have the known race vector to color the points, we could still pick out some clustering from this plot, which indicates underlying structure. If there were no underlying structure in our data, we would expect to see no clustering or systematic pattern in this plot.
As a counter example to this plot, we can plot the PCs from a random matrix (i.e. a matrix with no population structure):
# for the sake of illustration, I'm going to use the same dimensions as those
# in the smaller data in the previous example.
n <- 197
p <- 98
X <- matrix(rnorm(n * p), n, p)
pca_rand <- prcomp(X, center = TRUE, scale = TRUE)
rand_dat <- data.frame(PC1 = pca_rand$x[,1], PC2 = pca_rand$x[, 2])
rand_plot <- ggplot(rand_dat,
aes(x = PC1, y = PC2)) +
geom_point() +
coord_fixed()
plot(rand_plot)
We see that in this plot, there is underlying structure = no clustering or pattern. If I saw results like this “in the wild”, I would be fairly confident that population structure is not a major issue in my data set.
Now, let’s go back to our running example with the large SNP data set. For our case, let’s call our SNP data \(X\). I can think of \(X\) as a matrix, where I have \(n\) rows (in this case, n = 1401) and \(p\) is the number of columns (in this case, p = 752,675). Since we have \(p >> n\), we cannot simply use the spectral decomposition to calculate the eigenvalues of the covariance matrix \(\mathbf{S} \equiv \tilde{\mathbf{X}}^T \tilde{\mathbf{X}}\), since this \(\mathbf{S}\) will not be positive definite (and it will be HUGE!). Instead, we will use a method that exploits the power of the singular value decomposition:
Standardize (center and scale) the data \(\mathbf{X}\), and call the resulting matrix \(\tilde{\mathbf{X}}\).
Calculate \(\tilde{\mathbf{X}} \equiv \mathbf{U}\mathbf{D}\mathbf{V}^T\) using the singular value decomposition (SVD). Here, \(\mathbf{U}\) and \(\mathbf{V}\) are both orthogonal matrices, and \(\mathbf{D}\) is a diagonal matrix of non-negative numbers. We say that \(\mathbf{U}\) and \(\mathbf{V}\) contain the left and right singular vectors, respectively. The values on the diagonal of \(\mathbf{D}\) are the singular values.
We can calculate the principal components using the vectors of \(\mathbf{U}\), which is super cool! This is possible because the columns of \(\mathbf{U}\) are eigenvectors of the covariance matrix \(\mathbf{S}\). Specifically, the \(i^{th}\) principal component \(p_i\) can be calculated as \(p_i = \tilde{\mathbf{X}}u_i\), where \(u_i\) is the \(i^{th}\) column of \(\mathbf{U}\).
In other words, using SVD gives us a way to find the eigenvalues of \(\mathbf{S}\) that does not involve finding its spectral decomposition.
We will put these steps into practice now - if you have saved the fully imputed data set, load it here:
X <- readRDS(file = "data/fully_imputed_numeric.rds")
Remember, we cannot use the prcomp() function on our SNP data set - it is too big. We will need some other tools:
I will use the function std from the ncvreg package to standardize \(\mathbf{X}\)
I will use the svds() function from the RSpectra package do implement the singular value decomposition
Since we are usually only interested in the first few PCs, I will calcuate just the first 4 PCs in this example.
# Center and scale the design matrix
X_scaled <- ncvreg::std(X)
# remove unscaled X object to save memory
rm(X)
# use singular value decomposition to derive the left singular vectors
# NB: on my laptop, this took about 10 minutes to run
singular_vectors <- svds(t(X_scaled),
k = 4,
nu = 4, # NB: default is nu = k
nv = 0)
# check the dimension of the returned "u" matrix - we should have ncol(u) = k
dim(singular_vectors$u)
# [1] 752675 4
# calculate the PCs
PCs <- X_scaled %*% singular_vectors$u
# find the standard deviations of each PC
PC_sds <- apply(X = PCs,
MARGIN = 2, # columns are PCs
FUN = sd)
# create a scree plot
plot(x = 1:ncol(PCs),
y = 100 * proportions(PC_sds^2),
type = 'b',
ylab = 'Proportion of variance explained',
xlab = 'PC',
main = 'Scree Plot: SNP data')
Here, we see that the first PC explains over half of the variance in the data set. This indicates that there is probably evidence of population structure in this data set. Based on this, I expect that a plot of the principal components themselves will show strong clustering.
# create a plot of the first two PCs
PC_dat <- as_tibble(PCs) # make a data set (better than a matrix for plotting)
names(PC_dat) <- paste0("PC", 1:ncol(PCs))
# plot these principal components
PC_dat %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_point() +
coord_fixed()
We notice in the plot above that there is indeed notable clustering - we have evidence that our SNP data does have population structure. We will need to keep this in mind as we move forward to analysis, as this will guide our choice(s) of tools for analysis and interpretation.
Similar to examining racial group in our example PCA, we can use the clinical features in our SNP data set to see if there is any relationship between these clinical factors and the clustering we see in our PCA.
# load our clinical data
clinical <- fread('data/GWAStutorial_clinical.csv') %>%
mutate(across(.cols = c(sex, CAD),
.fns = as.factor))
# remind ourselves of what clinical/demographic information we have
head(clinical)
# FamID CAD sex age tg hdl ldl
# 1: 10002 1 1 60 NA NA NA
# 2: 10004 1 2 50 55 23 75
# 3: 10005 1 1 55 105 37 69
# 4: 10007 1 1 52 314 54 108
# 5: 10008 1 1 58 161 40 94
# 6: 10009 1 1 59 171 46 92
# plot these principal components, using color codes to differentiate sexes
PC_dat %>%
ggplot(aes(x = PC1, y = PC2,
col = clinical$sex)) +
geom_point() +
coord_fixed() +
theme(legend.position = "none")
Based on the plot above, sex does not seem to be associated with the clustering we see in the principal components.
For really large data (or data that did not need imputation), we’ll be using tools from the package SNPRelate which allow us to do key computations without having a fully imputed SNP data set. To use SNPRelate functions, we need to get our data in a GDS format. SNPRelate has a function snpgdsBED2GDS() to convert PLINK binary data into a GDS file, bed/bim/fam \(\longrightarrow\) GDS, but as far as I am aware, there is no tool to convert SnpMatrix objects to GDS, SnpMatrix \(\centernot\longrightarrow\). So, we need to first convert our qc data back to bed/bim/fam using the snpStats function write.plink(), and then use that to create a GDS file: SnpMatrix \(\longrightarrow\) bed/bim/fam \(\longrightarrow\) GDS.
I’ll re-load the qc-data from the quality control module.
qc_data <- readRDS('data/gwas-qc.rds')
write.plink(
file.base = "data/qc_data",
snps = qc_data$genotypes,
pedigree = qc_data$fam$pedigree,
id = qc_data$fam$member,
father = qc_data$fam$father,
mother = qc_data$fam$mother,
sex = clinical$sex,
phenotype = clinical$CAD + 1,
chromosome = qc_data$map$chromosome,
genetic.distance = qc_data$map$cM,
position = qc_data$map$position,
allele.1 = qc_data$map$allele.1,
allele.2 = qc_data$map$allele.2
)
If we wanted to spend the time using the mode imputation and rounding to get a qc-data set with absolutely no missingness, we could do bed/bim/fam/ \(\longrightarrow\) SnpMatrix bed/bim/fam/ \(\longrightarrow\) GDS (this is annoying).
We will now create our GDS file, use SNPRelate to compute the PCs, and plot them to see if it looks like there is any kind of sample structure as we saw in the previous example. We don’t have a known race or subpopulation status vector to color this plot, but we can look for clustering.
# create gds file so we can use SNPRelate - using unimputed qc data
qc_data.fn <- lapply(c(bed='bed', bim='bim', fam='fam', gds='gds'), function(x) paste0('./data/qc_data.', x))
# IMHO, this next function is poorly named, but it will serve our purposes here...
snpgdsBED2GDS(qc_data.fn$bed, qc_data.fn$fam, qc_data.fn$bim, qc_data.fn$gds)
# Start file conversion from PLINK BED to SNP GDS ...
# BED file: './data/qc_data.bed'
# SNP-major mode (Sample X SNP), 252.0M
# FAM file: './data/qc_data.fam'
# BIM file: './data/qc_data.bim'
# Sat Jun 4 12:34:09 2022 (store sample id, snp id, position, and chromosome)
# start writing: 1401 samples, 752675 SNPs ...
#
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 21s
# Sat Jun 4 12:34:30 2022 Done.
# Optimize the access efficiency ...
# Clean up the fragments of GDS file:
# open the file './data/qc_data.gds' (255.5M)
# # of fragments: 39
# save to './data/qc_data.gds.tmp'
# rename './data/qc_data.gds.tmp' (255.5M, reduced: 252B)
# # of fragments: 18
# open the GDS file
genofile <- snpgdsOpen(qc_data.fn$gds)
# get PCs
pca <- snpgdsPCA(genofile)
# Principal Component Analysis (PCA) on genotypes:
# Excluding 0 SNP on non-autosomes
# Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
# # of samples: 1,401
# # of SNPs: 752,675
# using 1 thread
# # of principal components: 32
# PCA: the sum of all selected genotypes (0,1,2) = 457958563
# CPU capabilities: Double-Precision SSE2
# Sat Jun 4 12:34:35 2022 (internal increment: 532)
#
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 4.2m
# Sat Jun 4 12:38:49 2022 Begin (eigenvalues and eigenvectors)
# Sat Jun 4 12:38:50 2022 Done.
# close the file
snpgdsClose(genofile)
# plot
plot(1:10, pca$varprop[1:10], type = 'b', ylab = 'Proportion of variance explained', xlab = 'PC')
# put top 10 pcs in a table
pctab <- data.frame(sample.id = pca$sample.id,
pca$eigenvect[, 1:10],
stringsAsFactors = FALSE)
names(pctab)[-1] <- paste0('PC', 1:10)
# plot
plot(pctab$PC2, pctab$PC1, xlab="Principal Component 2", ylab="Principal Component 1")
I want to thank Prof. Brian Smith - my brief explanation of principal component analysis (including all my notation) is based on his PCA lecture in his Machine Learning course (Spring 2022). If you are interested in doing more with machine learning, I highly recommend his R package MachineShop.
NB: definitions of the terms recent and ancient are somewhat subjective and hand-wavy since, in theory, if you look back far enough, everyone shares a common ancestor. However, the idea is that humans migrated, separated, and mated such that over time distinct groups developed allele frequencies different enough to confound an analysis.↩︎